Let’s explore white wines!!
White wines get no love. At least in my experience, friends and dinner dates are more likely to jump for the Cabs and Pinots than for a nice Sauvignon Blanc or Riesling. But perhaps that is more of a factor of approachability, and a lack of understanding. So let’s dive into the data, and maybe we can unearth some of the secrets of what makes a good white wine, and understand why white wines are amazing!
First I’m going to explore the structure:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
There are roughly 5,000 wines with 12 variables each. I think a good starting point is looking at the distribution of quality. As the only integer variable, and our likely dependent variable, it’s going to be good to get an idea of the distribution of quality wines in this dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are many mediocre wines. But we also have 5 great wines (rated 9) and a good mix of decent wines in the 7’s and 8’s.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
For the sake of comparison, the density of water is 1.0 g/cm^3, ethanol (alcohol) is 0.7893 g/cm^3, and sugar (glucose and fructose) ranges from 1.54 to 1.69 g/cm^3. We would expect wines to be less dense than water, having more alcohol, hence the density being below 1.0. The few wines that are at or above 1.0 should have a higher sugar content. This could be fun to look at. With that in mind, I want to look at the alcohol and sugar data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol percentage is pretty straightforward. It’s telling us how much acohol is in the wine. The mean and median are around 10.4 and 10.5 respectively, and the distribution is skewed to the right with the majority of wines in the 9-11 range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual Sugar is sugar that has not been fermented by the yeast and bacteria into alcohol and other compounds. I set up a second graph to look at the tail, and it appears there are not that many outliers. I would expect these high residual sugar wines to also have the higher densities as graphed above.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
pH is just telling us about the acidity of the wine. The lower the number, the more acidic the wine. I would expect this to be highly correlated with fixed and volatile acidity, as well as citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed Acidity is the measure in grams per liter of tartaric acid in the wine (fun fact: 1 decimeter cubed is equal to the volume of a liter). It is called a fixed acid, or nonvolatile, because it is difficult separate from the wine. Although there are many different acids that contribute to the taste of the wine, such as malic and succinic, the paper authors chose to test for tartaric acid. Wines typically have more malic acid. Interestingly wines from cool climates are typically higher in acidity than those from warmer places.
Anyway, it looks like the fixed acidity is typically in the 6.0 - 8.0 g/dm^3 range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
There is much less volatile acidity in these wines, which is a good thing. The researchers were testing for the presence of acetic acid, which is typically produced by acetobacteria, which convert alcohol and glucose into acetic acid. Acetic Acid is vinegar. There is usually some acetic acid naturally present due to the byproducts of natural yeast and bacteria that live on the grapes. In higher concentrations, acetic acid is a sign of spoilage, meaning a winemaker maybe didn’t ferment well. Typically, winemakers will add potassium sulphate, which we’ll see more later, to introduce Sulfur Dioxide (free sulfur dioxide) which acts as a preservative by killing acetobacteria. As it is, this sample doesn’t have concentrations higher than 0.5 g/dm^3. I don’t expect this variable to be very important to the wines quality, as the sensory threshold is roughly 0.6 to 0.9 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid, as far as the internet is telling me, is an additive for creating a “fresh” or “crisp” taste to the wines. It looks like most of the wines have a concentration of around 0.3 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Interestingly, there is some salt present in the wines. I believe this is a natural byproduct. The grapes produce salt as they grow. For the most part, the concentrations seem fairly minor, so it’ll be interesting to see if this has an impact on the quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Potassium Sulphate is added to wine as a preservative. Fun Fact, there is no scientific evidence that sulphites present in wine (and ALL wines have sulphites present) causes headaches. Anyway, potassium sulphate dissolves in the wine making both free and bound sulfur dioxide. Free sulfur dioxide in high concentrations contributes to a gassy smell, whereas bound sulfur dioxides bind to yeast and bacteria, acts as an antioxidant, and binds to heavy compounds to preserve and mellow out wine. The relevant charts for both are below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
I think it’s good that, in general, we see total sulfur dioxide in concentrations more than free sulfur dioxide. Other than that, I don’t have much else to add here. It’s going to be more interesting to see the relationship between the different variables, and how they relate to quality.
There are 4,898 different white wines with a integer variable quality and 11 continuous variables testing for the presence of various physical compounds in the wine. Of those variables, I think it’s fair to group certain variables with each other:
I expect chlorides to be a non-factor when it comes to quality, due to the relatively low concentrations in all the wines. I may create some new variables out of the groups above, depending on the results of the bivariate analysis below.
For the most part, the data appears to be fairly clean, so I’m not expecting any issues with moving forward.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.023 0.289
## volatile.acidity -0.023 1.000 -0.149
## citric.acid 0.289 -0.149 1.000
## residual.sugar 0.089 0.064 0.094
## chlorides 0.023 0.071 0.114
## free.sulfur.dioxide -0.049 -0.097 0.094
## total.sulfur.dioxide 0.091 0.089 0.121
## density 0.265 0.027 0.150
## pH -0.426 -0.032 -0.164
## sulphates -0.017 -0.036 0.062
## alcohol -0.121 0.068 -0.076
## quality -0.114 -0.195 -0.009
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar 1.000 0.089 0.299
## chlorides 0.089 1.000 0.101
## free.sulfur.dioxide 0.299 0.101 1.000
## total.sulfur.dioxide 0.401 0.199 0.616
## density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.091 0.265 -0.426 -0.017 -0.121
## volatile.acidity 0.089 0.027 -0.032 -0.036 0.068
## citric.acid 0.121 0.150 -0.164 0.062 -0.076
## residual.sugar 0.401 0.839 -0.194 -0.027 -0.451
## chlorides 0.199 0.257 -0.090 0.017 -0.360
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059 -0.250
## total.sulfur.dioxide 1.000 0.530 0.002 0.135 -0.449
## density 0.530 1.000 -0.094 0.074 -0.780
## pH 0.002 -0.094 1.000 0.156 0.121
## sulphates 0.135 0.074 0.156 1.000 -0.017
## alcohol -0.449 -0.780 0.121 -0.017 1.000
## quality -0.175 -0.307 0.099 0.054 0.436
## quality
## fixed.acidity -0.114
## volatile.acidity -0.195
## citric.acid -0.009
## residual.sugar -0.098
## chlorides -0.210
## free.sulfur.dioxide 0.008
## total.sulfur.dioxide -0.175
## density -0.307
## pH 0.099
## sulphates 0.054
## alcohol 0.436
## quality 1.000
So the highest correlations are with density and both residual sugar (0.839) and alcohol (-0.780), which makes sense as I explained above. Free Sulfur Dioxide is decenlty correlated with Total Sulfur Dioxide, although I would have expected a much larger relationship. And, for the most part, nothing is fairly highly positively or negatively correlated with quality, or with each other. This might make the rest of the analysis kind of difficult.
This ggplot does a good job of showing just how uncorrelated a bunch of the variables are. I am going to try removing some outliying values, and see if that makes an of the variables more correlated.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 775 775 9.1 0.27 0.45 10.6
## 821 821 6.6 0.36 0.29 1.6
## 828 828 7.4 0.24 0.36 2.0
## 877 877 6.9 0.36 0.34 4.2
## 1606 1606 7.1 0.26 0.49 2.2
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 775 0.035 28 124 0.99700 3.20
## 821 0.021 24 85 0.98965 3.41
## 828 0.031 27 139 0.99055 3.28
## 877 0.018 57 119 0.98980 3.28
## 1606 0.032 31 113 0.99030 3.37
## sulphates alcohol quality
## 775 0.46 10.4 9
## 821 0.61 12.4 9
## 828 0.48 12.5 9
## 877 0.36 12.7 9
## 1606 0.42 12.9 9
## 99.9%
## 1.002466
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.027 0.292
## volatile.acidity -0.027 1.000 -0.162
## citric.acid 0.292 -0.162 1.000
## residual.sugar 0.085 0.045 0.087
## chlorides 0.024 0.070 0.118
## free.sulfur.dioxide -0.049 -0.102 0.103
## total.sulfur.dioxide 0.087 0.084 0.123
## density 0.268 0.002 0.152
## pH -0.429 -0.034 -0.167
## sulphates -0.020 -0.038 0.066
## alcohol -0.123 0.066 -0.088
## quality -0.110 -0.194 -0.011
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.085 0.024 -0.049
## volatile.acidity 0.045 0.070 -0.102
## citric.acid 0.087 0.118 0.103
## residual.sugar 1.000 0.089 0.324
## chlorides 0.089 1.000 0.103
## free.sulfur.dioxide 0.324 0.103 1.000
## total.sulfur.dioxide 0.415 0.201 0.610
## density 0.832 0.261 0.320
## pH -0.201 -0.090 -0.006
## sulphates -0.029 0.016 0.058
## alcohol -0.463 -0.360 -0.259
## quality -0.100 -0.211 0.025
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.087 0.268 -0.429 -0.020 -0.123
## volatile.acidity 0.084 0.002 -0.034 -0.038 0.066
## citric.acid 0.123 0.152 -0.167 0.066 -0.088
## residual.sugar 0.415 0.832 -0.201 -0.029 -0.463
## chlorides 0.201 0.261 -0.090 0.016 -0.360
## free.sulfur.dioxide 0.610 0.320 -0.006 0.058 -0.259
## total.sulfur.dioxide 1.000 0.550 0.001 0.132 -0.458
## density 0.550 1.000 -0.100 0.073 -0.806
## pH 0.001 -0.100 1.000 0.156 0.121
## sulphates 0.132 0.073 0.156 1.000 -0.018
## alcohol -0.458 -0.806 0.121 -0.018 1.000
## quality -0.165 -0.317 0.099 0.056 0.438
## quality
## fixed.acidity -0.110
## volatile.acidity -0.194
## citric.acid -0.011
## residual.sugar -0.100
## chlorides -0.211
## free.sulfur.dioxide 0.025
## total.sulfur.dioxide -0.165
## density -0.317
## pH 0.099
## sulphates 0.056
## alcohol 0.438
## quality 1.000
I don’t think there’s an appreciable difference in correlation as a result of removing some of the outlying values. I am going to proceed by using the original dataset.
I’d like to start by looking at the relationship between fixed and volatile acidity, and free and total sulfur dioxide.
This graph shows no discernable relationship between between Fixed and Volatile Acidity. I thought that the two might be related, seeing as they’re both measuring acidity, but that doesn’t seem to be the case. Let’s move on to sulfur dioxide.
I set limits to the x axis to expand the graph slightly, which really does a good job of showing the relationship between free and total suflur dioxide. Although I would expect this to be more linear, since free sulfur dioxide is a part of total sulfur dioxide, I’m a bit surprised to see the variability in suflur dioxide in the wines.
I’d like to look at the relationship between residual sugar, alcohol and density, but I think I’m going to save that till the multivariate analysis section. Instead, I’m going to explore the relationship of some of the variables with quality. First alcohol.
Interestingly enough, it almost looks like the less alcohol a wine has, the more likely it is going to be rated lower. I wonder if that is a factor of taste, or if the raters were displeased by drinking “weaker” wines. As a sort of allegory, let’s look at density with quality.
OK, so for the most part, I think a similar pattern has emerged, that higher quality wines are gerneally less dense. Most wines are in the 0.99 to 1.00 range, and I’m not sure what this translates to tasting wise.. This is almost the same conclusion as above (higher alcohol), just restated, since the more alcohol a wine has, the less dense it will be.
So, I did a little more research, and fixed acidity (tartaric acid in this case), is what gives a wine it’s sourness. I wanted to see if there was anything we could understand about the quality from it’s sourness. But it looks like the sourness of wines comes in different levels, and there doesn’t seem to be anything that really distinguishes higher quality wines from the lesser ones. Let’s look at volatile acidity levels too, as a analogue for spoilage.
In general, it appears as though the same trend emerges: all wines at all qualities display pretty varying levels of volatile acidity.
Looking at sweetness, it appears as though most wines are relatively not sweet. The threshold that distinguishes a dry wine is about 5 g/dm^3. There are a lot of wines easily below this (as represented by the bold areas in the 0 - 5 range), but also a healthy amount of wines above. However, it does appear as though there is a cap, almost, with increasing levels of quality linked to lower caps on residual sugar. As an example, the cap for a 6 quality wine looks to be around 18, with a cap of 15 for 7 quality, and a cap at 14 for 8 quality. There are not enough data points for the 9th level of quality to make an notes.
I wanted to see if there was any patter between the sweetness and the sourness of wines, but it seems to be mostly uncorrelated.
Interestingly, there almost seems to be a channel with regards to free and total sulfur dioxide in the wine. The density of the points almost makes it seem as though the sweet spot for sulfur dioxide presence is around 25-50 free and 100-150 total g/dm^3 sulfur dioxide.
This sort of shows what I’m talking about.
With the levels so low, and so heavily bunched at the bottom, I don’t think there’s much to learn here.
If we really stretch the boxplot’s y axis, the less salt content, the better the wine.
Citric Acid, weirdly, has an almost similar dynamic to quality as sulfur dioxide, where there is an almost optimal amount at around 0.33 or so.
And finally, just to see the results, pH and sulphates look completely uncorrelated with quality.
I would really like to see more examples of 3, 4, 8, and 9 quality wines. I think that would make it a lot easier to see the relationships between quality and the other variables. That being said, alcohol percentage and density do a good job of telling you about the quality. As for the other variables, I think there are subtle relationships, but it’s hard to tell if they are meaningful, especially considering the low correlations, and the lack of data for higher and lower quality levels.
Alright, so adding alcohol as a color makes it really apparent how the higher quality wines typically have higher alcohol contents.
I really wanted to see sugar and acidity colored by alcohol. There’s a subtle relationship where the less sugar and acidity, the more alcohol. The sugar part of the makes sense, meaning the yeast did its job well. But the acidity part is really interesting here. I also colored by quality, but there was absolutely no pattern.
I think this does a fantastic job of showing the relationship between residual sugar, alcohol, and density.
I’m not sure there’s much of anything here.
I think this relationship is pretty apparent, but the more sulphates in the wine, the more free sulfur dioxide, and consequentially, the more total sulfur dioxide. Nothing of note here.
Nothing emerges here either.
This does a good job of showing the relationship between density, alcohol, and quality. But since alcohol percentage and density are so closely related (in the physical sense), I don’t think this graph is very helpful.
Not helpful.
Base on the plots, I thought I might take a stab at creating a linear model.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + density, data = wine)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wine)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides,
## data = wine)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity, data = wine)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity + volatile.acidity, data = wine)
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity + volatile.acidity + free.sulfur.dioxide, data = wine)
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity + volatile.acidity + free.sulfur.dioxide +
## total.sulfur.dioxide, data = wine)
## m9: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity + volatile.acidity + free.sulfur.dioxide +
## total.sulfur.dioxide + sulphates, data = wine)
## m10: lm(formula = quality ~ alcohol + density + residual.sugar + chlorides +
## fixed.acidity + volatile.acidity + free.sulfur.dioxide +
## total.sulfur.dioxide + sulphates + pH, data = wine)
##
## ================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10
## ------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 90.313*** 87.563*** 72.800*** 51.143*** 50.940*** 49.162*** 74.922*** 149.901***
## (0.098) (6.165) (12.374) (12.392) (14.224) (13.784) (13.741) (14.268) (14.880) (18.760)
## alcohol 0.313*** 0.360*** 0.246*** 0.237*** 0.252*** 0.306*** 0.313*** 0.313*** 0.282*** 0.194***
## (0.009) (0.015) (0.018) (0.018) (0.020) (0.019) (0.019) (0.019) (0.020) (0.024)
## density 24.728*** -87.886*** -84.931*** -69.981*** -48.106*** -48.147*** -46.346** -72.393*** -149.987***
## (6.079) (12.317) (12.340) (14.222) (13.783) (13.740) (14.280) (14.907) (19.029)
## residual.sugar 0.053*** 0.052*** 0.046*** 0.044*** 0.041*** 0.040*** 0.050*** 0.081***
## (0.005) (0.005) (0.006) (0.005) (0.005) (0.006) (0.006) (0.008)
## chlorides -1.776** -1.852*** -0.794 -0.926 -0.922 -0.852 -0.234
## (0.555) (0.556) (0.540) (0.539) (0.539) (0.537) (0.543)
## fixed.acidity -0.033* -0.049** -0.042** -0.042** -0.026 0.066**
## (0.015) (0.015) (0.015) (0.015) (0.015) (0.021)
## volatile.acidity -2.064*** -1.993*** -1.983*** -1.939*** -1.868***
## (0.110) (0.110) (0.112) (0.112) (0.112)
## free.sulfur.dioxide 0.004*** 0.004*** 0.004*** 0.004***
## (0.001) (0.001) (0.001) (0.001)
## total.sulfur.dioxide -0.000 -0.000 -0.000
## (0.000) (0.000) (0.000)
## sulphates 0.590*** 0.632***
## (0.101) (0.100)
## pH 0.684***
## (0.105)
## ------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.210 0.212 0.213 0.266 0.270 0.271 0.276 0.282
## adj. R-squared 0.190 0.192 0.210 0.211 0.212 0.265 0.269 0.269 0.274 0.280
## sigma 0.797 0.796 0.787 0.787 0.786 0.759 0.757 0.757 0.754 0.751
## F 1146.395 583.290 434.085 328.736 264.067 295.080 259.001 226.616 206.650 191.810
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5776.812 -5771.696 -5769.463 -5598.011 -5582.291 -5582.183 -5564.958 -5543.767
## Deviance 3112.257 3101.773 3033.737 3027.406 3024.647 2820.137 2802.092 2801.969 2782.330 2758.359
## AIC 11684.782 11670.255 11563.624 11555.391 11552.927 11212.022 11182.582 11184.367 11151.916 11111.534
## BIC 11704.272 11696.241 11596.107 11594.371 11598.403 11263.995 11241.051 11249.333 11223.378 11189.493
## N 4898 4898 4898 4898 4898 4898 4898 4898 4898 4898
## ================================================================================================================================================
I had an m11 with citric.acid, but it did added 0 value. As it stands, the predictive power does increase, and all the p-values are substantial, but the R-squared is just too low to consider any of these models predictive.
##
## Calls:
## n1: lm(formula = quality ~ alcohol, data = wine)
## n2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## n3: lm(formula = quality ~ alcohol + residual.sugar + chlorides,
## data = wine)
## n4: lm(formula = quality ~ alcohol + residual.sugar + chlorides +
## volatile.acidity, data = wine)
## n5: lm(formula = quality ~ alcohol + residual.sugar + chlorides +
## volatile.acidity + free.sulfur.dioxide, data = wine)
## n6: lm(formula = quality ~ alcohol + residual.sugar + chlorides +
## volatile.acidity + free.sulfur.dioxide + sulphates, data = wine)
## n7: lm(formula = quality ~ alcohol + residual.sugar + chlorides +
## volatile.acidity + free.sulfur.dioxide + sulphates + pH,
## data = wine)
##
## ====================================================================================================
## n1 n2 n3 n4 n5 n6 n7
## ----------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** 2.021*** 2.276*** 2.464*** 2.257*** 2.044*** 1.187***
## (0.098) (0.117) (0.135) (0.131) (0.135) (0.143) (0.271)
## alcohol 0.313*** 0.354*** 0.339*** 0.368*** 0.374*** 0.375*** 0.374***
## (0.009) (0.010) (0.011) (0.011) (0.011) (0.011) (0.011)
## residual.sugar 0.022*** 0.021*** 0.026*** 0.023*** 0.023*** 0.025***
## (0.002) (0.003) (0.002) (0.002) (0.002) (0.003)
## chlorides -2.062*** -0.907 -1.056 -1.076* -0.939
## (0.556) (0.540) (0.539) (0.538) (0.538)
## volatile.acidity -2.086*** -2.010*** -1.998*** -1.996***
## (0.110) (0.110) (0.110) (0.110)
## free.sulfur.dioxide 0.004*** 0.004*** 0.004***
## (0.001) (0.001) (0.001)
## sulphates 0.420*** 0.365***
## (0.095) (0.096)
## pH 0.277***
## (0.074)
## ----------------------------------------------------------------------------------------------------
## R-squared 0.190 0.202 0.204 0.259 0.265 0.267 0.270
## adj. R-squared 0.190 0.202 0.204 0.258 0.264 0.267 0.269
## sigma 0.797 0.791 0.790 0.763 0.760 0.758 0.757
## F 1146.395 619.354 418.558 427.455 351.981 297.659 257.788
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5802.158 -5795.291 -5620.674 -5602.034 -5592.328 -5585.395
## Deviance 3112.257 3065.298 3056.715 2846.355 2824.773 2813.600 2805.646
## AIC 11684.782 11612.317 11600.583 11253.347 11218.068 11200.655 11188.790
## BIC 11704.272 11638.303 11633.066 11292.327 11263.544 11252.628 11247.259
## N 4898 4898 4898 4898 4898 4898 4898
## ====================================================================================================
I thought I might try taking out some of the variables which didn’t seem to affect R-squared that much, but I’ve only decreased the effectiveness of the linear model.
Coloring some of the graphs in the bivariate analysis section really highlighted the relationship between alcohol percentage and quality. I also created some nice visual representations between some of the more similar variables, such as free and total suflur dioxide and sulphates, and density, residual sugar, and alcohol.
Further, I went on to try creating a linear model for quality, but none of the models created are particularly effective at predicting the quality of the wine. Using all the available variables created a model with a predictive ability of 27%, but that isn’t high enough to consider using in a practical sense.
This graph highlights the biggest issue with this dataset: there isn’t enough data. I think to really get a better understanding of what makes a good quality wine, the scale either needs to be continuous, or we need more high and low quality wines.
This plot does an excellent job of showing the relationship between alcohol percentage and quality. Perhaps not surprisingly, the more alcohol, the higher the quality.
Finally, the third graph really shows the relationship between residual sugar, alcohol, and density. While not doing much to contribute to the quality of the wine, I thought this was a neat visualization of one of the physical aspects of wine.
The wine dataset contained 4,898 observations of wine from various regions of Portugal. I started by getting a sense of the individual variables, before looking into their relationships with each other. As expected with any dataset, each variable had some large outliers. I built a correlation table both with and without the outlying variable rows, but there wasn’t an appreciable difference in correlations. For the most part, outside of alcohol, no variables were particularly correlated with quality. There were some variables that were well correlated, such as alcohol and density, and free and total sulfur dioxide, but these tended to related to actual physical relationships.
During my bivariate plotting, I did find some subtle relationships between the different variables and quality. For example, free and total sulfur dioxide tended to funnel towards an ideal amount for higher qualities. Higher qualities had less salt, volatile acid, and residual sugar. For the most part, however, there was no discernable relationship. Adding a colored variable highlighted this further.
For the most part, alcohol was the best indicator of wine quality, with more alcohol meaning higher quality. I’d like to joke that the wine judges liked wines that got you inebriated faster, but I some cursory research showed that higher alcohol contents adds a lot more complexity to texture and taste. If anything the variability of wines in the middle tier, showed that wine quality is a very complex thing.
I tried creating a linear model to predict quality, but the results, though significant, had very low r-squared values, leading me to believe that the models might not hold up well in predicting a wines quality.
I think the biggest detractor from this dataset is that there isn’t enough data on very low and very high quality wines. If there were some more wines in those categories, then perhaps some of the relationships might be more revealing. Further, the data might require a more sophisticated modelling method to derive any meaningful results, and I don’t have the knowledge to do that quite yet. This has been fun, and it’s a lovely day here in Southern California, and now I really want to go have some wine!
Wine Quality info Summary: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt,
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Vinho Verde website, for more on the wines used in the dataset: http://www.vinhoverde.pt/en/homepage
UC Davis Waterhouse Lab, “What’s In Wine”: http://waterhouse.ucdavis.edu/whats-in-wine
Sulfur Dioxide wikipedia: https://en.wikipedia.org/wiki/Sulfur_dioxide
Wine Mouthfeel and Texture: https://wine.appstate.edu/sites/wine.appstate.edu/files/Wine%20Mouthfeel%20and%20Texture.pdf